Inventor Disambiguation Report

Author

PatentsView-Evaluation

Published

November 29, 2022

This is report has been automatically generated for the following data source(s) at 10:06PM on November 29, 2022:

For more information, please refer to the PatentsView-Evaluation project homepage or to the report source code.

Summary Statistics

The plot below provides the number of inventors with a given number of (co-)authored patents.

We can read from the plot the number of inventors with a single authored patent, with exactly two authored patents, and so forth. This distribution of the number of authored patents patents per inventor is called the cluster sizes distribution of the disambiguation.

When comparing disambiguation results, look for shifts in shape of the cluster sizes distribution. Is one of the distribution more skewed to the left than another? This could indicate that one of the disambiguation favors smaller clusters, possibly resulting in higher precision but lower recall.

The table below provides the inventors with the largest number of authored patents.

Make sure to sort the table by number of patents.

When comparing disambiguation results, look for large changes in the ranking of inventors. Large changes in the estimated numbers of authored patents may also warrant the need to investigate the behavior of the disambiguation for these prolific inventors.

The Hill Numbers entropy curve is a characterization of the cluster sizes distribution.

It is based on Hill Numbers of order \(q\), which are exponentiated Rényi \(q\)-entropy. That is, for a given \(q > 0\) and for \(p_i\) the proportion of inventors with \(i\) authored patents, the corresponding Hills Number is defined as \[ H_q = \left ( \sum_{i=1}^{\infty} p_i^{q}\right )^{1/(1-q)} \] The Hill Numbers entropy curve is the plot of Hill Numbers as a function of \(q > 0\).

For \(q=0\), the Hill Number is defined as the number of indices \(i\) such that \(p_i > 0\). This is the size of the support of the cluster sizes distribution \[ H_0 = \# \left\{ i > 0 \,:\, p_i > 0 \right\}. \]

For \(q=1\), the Hill Number is defined as the exponentiated Shannon entropy \[ H_1 = \exp \left ( - \sum_{i=1}^{\infty} p_i \log p_i \right ). \]

For \(q=2\), the Hill number is the inverse of the probability that two randomly sampled inventors have the same number of authored patents: \[ H_2 = \sum_{i=1}^{\infty} p_i^2. \]

When comparing disambiguation results, look for major relative differences between entropy curves. These represent differences in the cluster sizes distribution which can be further investigated using the cluster sizes distribution plot.

Higher Hill Numbers represent a more spread out cluster sizes distribution, while lower values represent more peaked distributions. The order \(q\) of the Hill Numbers represent how the cluster sizes proportions \(p_i > 0\) are accounted for. At \(q = 0\), we have the number of possible cluster sizes in the data. As \(q \rightarrow \infty\), the Hill Number tends to the proportion of inventors with the most common cluster size.

Cluster homogeneity is the level similarity of among a cluster’s elements. For inventor disambiguation, this is about the variation between the way that an inventor is represented on its patents (e.g., different name spellings).

Here we look at cluster homogeneity from a binary perspective – whether or not there is variation, within a cluster, in how an inventor’s name is spelled. The proportion of inventors with a unique name (no name variation within its cluster) is our metric of cluster homogeneity called the homogeneity rate.

In the plot below, the homogeneity rate (i.e., clusters with no name variation) is plotted as a function of cluster size. For inventors with a single patent, the proportion of homogeneous clusters is trivially 1. For inventors with two patents, we can read off the proportion of them with no name variation, and so forth.

When comparing two disambiguation results, look for changes in the homogeneity rate across cluster sizes. A higher homogeneity rate means possibly smaller, more robust clusters. On the other hand, lower homogeneity rates may be associated with an increased error probability.

Between-cluster similarity is the level of similarity between different clusters. For inventor disambiguation, this is about different inventors that have similar representations on some patents, such as having the same names.

Here we look at between-cluster similarity from a binary perspective – whether or not an inventor’s name is shared with another inventor. The proportion of inventors sharing their name with someone else is our metric of between-cluster similarity which we call the homonymy rate.

In the plot below, the homonymy rate is plotted as a function of cluster sizes.